gensim topic model

安装ltp分词

1
pip install pyltp

下载最新的模型放到机器上,比如放在了~/data/ltp_data

1
2
3
4
5
from pyltp import Segmentor
segmentor = Segmentor()
segmentor.load("~/data/ltp_data/cws.model")
text = "我爱看电影"
print(list(segmentor.segment(text)))

将分词好的文档存为tokenized.txt一行一个document,每个词用空格隔开。

训练LDA模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#!/bin/bash python
from gensim.models.ldamulticore import LdaMulticore
from gensim import corpora

texts = []
with open('tokenized.txt') as f:
for line in f:
text = line.strip()
texts.append(text.split())

dictionary = corpora.Dictionary(texts)
print(dictionary)
dictionary.filter_extremes(10, 0.1)
print(dictionary)
dictionary.save("text.dict")
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('corpora.mm', corpus)
corpus = corpora.MmCorpus('corpora.mm')
print("Starting Training LDA Model...")
lda = LdaMulticore(corpus, num_topics=100, workers=11)
print(lda.save("lda.model"))
分享到